Exploiting Loop-Level Parallelism for SIMD Arrays Using OpenMP
نویسندگان
چکیده
Programming SIMD arrays in languages such as C or FORTRAN is difficult and although work on automatic parallelizing programs has achieved much, it is far from satisfactory. In particular, almost all ‘fully’ automatic parallelizing compilers place fundamental restrictions on the input language. Alternatively OpenMP provides an approach to parallel programming that supports incremental improvements to applications that places restrictions in the context of preserving semantics of a parallelized section. OpenMP is limited to a thread based model that does not define a mapping for other parallel programming models. In this paper, we describe an alternative approach to programming SIMD machines using an extended subset of OpenMP (OpenMP SIMD), allowing us to model naturally the programming of arbitrary sized SIMD arrays while retaining the semantic completeness of both OpenMP and the parent languages. Using the CSX architecture we show how OpenMP SIMD can be implemented and discuss future extensions to both the language and SIMD architectures to better support explicit parallel programming.
منابع مشابه
Explicit Vector Programming with OpenMP 4.0 SIMD Extensions
Modern CPU and GPU processors with on-die integration of SIMD execution units for achieving higher performance and power efficiency have posed challenges to use the underlying SIMD hardware (or VPUs, Vector Processing Unit) effectively. Wide vector registers and SIMD instructions –Single Instructions operating on Multiple Data elements packed in wide registers such as AltiVec [2], SSE, AVX[10] ...
متن کاملSIMD parallel MCMC sampling with applications for big-data Bayesian analytics
We present a single-chain parallelization strategy for Gibbs sampling of probabilistic Directed Acyclic Graphs, where contributions from child nodes to the conditional posterior distribution of a given node are calculated concurrently. For statistical models with many independent observations, such parallelism takes a Single-Instruction-Multiple-Data form, and can therefore be efficiently imple...
متن کاملOverlapping communication and computation with OpenMP and MPI
Machines comprised of a distributed collection of shared memory or SMP nodes are becoming common for parallel computing. OpenMP can be combined with MPI on many such machines. Motivations for combing OpenMP and MPI are discussed. While OpenMP is typically used for exploiting loop-level parallelism it can also be used to enable coarse grain parallelism, potentially leading to less overhead. We s...
متن کاملStrategies for the efficient exploitation of loop-level parallelism in Java
This paper analyzes the overheads incurred in the exploitation of loop-level parallelism using Java Threads and proposes some code transformations that minimize them. The transformations avoid the intensive use of Java Threads and reduce the number of classes used to specify the parallelism in the application (which reduces the time for class loading). The use of such transformations results in...
متن کاملOptimization Strategies for WRF Single-Moment 6-Class Microphysics Scheme (WSM6) on Intel Microarchitectures
Optimizations in the petascale era require modifications of existing codes to take advantage of new architectures with large core counts and SIMD vector units. This paper examines high-level and low-level optimization strategies for numerical weather prediction (NWP) codes. These strategies employ thread-local structures of arrays (SOA) and an OpenMP directive such as OMP SIMD. These optimizati...
متن کامل